Windows HTTPD and TCP: A Report
Updated 30-Apr-95

Robert B. Denny
April 30, 1994

INTRODUCTION

Over the past 8 months, I have done quite a bit of research into the "win httpd" problems where the server appears to go to sleep or causes some sort of a system fault. Last Fall, I solved a couple of problems where GPFs would occur under rare circumstances during a connection abort/cancellation. Since then, the server itself has been stable.

I have performed extensive testing with Microsoft TCP32 3.11, 3.11a, and recently, the 3.11b beta, with Netmanage Chameleon 4.02 and with Trumpet 2.0B. All exhibit problems in handling connection aborts. Netscape issues more of these aborts, on average, because is can have multiple connections open when the user jumps off a link or hits the stop sign. This is NOT a bug in Netscape! However, this kind of use of TCP is something that, in my opinion, has not been tested on at least some of the Windows TCP packages.

BACKGROUND

First, it is important to understand that servers use a different set of functionality in a TCP package when compared to clients. Therefore, you can expect to see different sorts of problems with servers versus clients. Secondly, HTTP (the Web protocol) is unique in its high frequency of aborting connections. This is a normal part of HTTP, but it must be done with care, else the underlying TCP machinery can get out of whack. Finally, and probably most importantly, the TCP packages at each end of the connection must perfom the protocol correctly. If either package screws up, it is possible for one or both ends to get out of whack. The network manifesto is "strict in what you do, permissive in what you'll accept". Nonetheless, some packages can get disturbed by misbehavior at the other end. Needless to say, this has far-reaching consequences for server operators... In order to understand the specific causes of the observed problems, you need to understand a bit about TCP connection opening and closing. This is a long topic, and more than I can explain in detail here, so if you need more information I suggest you have a look at Douglas Comer's bibles Internetworking with TCP/IP (2 volumes).

NOTE: My explanations will take the view of a client/browser and a server, but the process is more general, and needn't follow the steps I describe.

Opening a Connection

The opening process is a 3-way handshake where the client sends a SYN packet to the server, the server sends a SYN/ACK back to the client ("I saw your SYN, and here's mine"), then the client sends an ACK back to the server ("OK, I saw your SYN too, we're all set"). At this point the connection is established.

Closing a connection

Either end may close a connection. Normally, the server sends the response data and then closes the connection. The client reads until it sees the closure, at which time it assumes it has received the response. If the client desires to abort the connection it closes the connection, not the server.

There are 2 ways to close a connection, the "hard" way and the "soft" way. When the server sends its response and closes the connection, it must do a soft close, wherein all unsent data is delivered to the client before the socket is actually closed. On the other hand, if the client wants to abort the connection, it should do a hard close, where unsent data is trashed and the connection is closed immediately.

In TCP terms a soft close is done by one end (A) sending a FIN message. The other end (B) sees the FIN, but it may have more to do, so it notes it and goes on. Eventually B is finished and sends a FIN back to A. Meanwhile, A is still alive responding to B so B can finish up. When A sees B's FIN, it knows it's all over and shuts down. B can shut down when it sends the FIN to A. This is actually a white lie, there are several interlock states in the TCP machine that the sockets go through on their way down and they involve ACK exchanges and the using program calling closesocket().

A hard close is where A sends a RST packet to B. At that point, B says "forget it", dumps its unsent data on the floor and shuts down the socket (again through a few intermediate states). A does the same. End of story.

Microsoft TCP32 3.11/3.11a/3.11b-beta

The original 3.11 of TCP32 (aka "Wolverine") has problems when the server delivers large files. The effect is to cut off the end of the file. This was fixed in the 3.11a release. 3.11a and the recent beta of 3.11b have problems that result in an access violation in the VxD, which causes the machine to drop out of Windows, leaving it at the DOS prompt. This appears to have been fixed in the 3.11b release candidate.

Trumpet 2.0B

I have included the text of a message on alt.winsock from Peter Tattam, Trumpet's author. A you will see, he is aware of the problems and I believe he is right on in his suspicions. Unfortunately, as of this date, the shareware version of Trumpet has not been fixed. There was a 2.0E version out for a while, but results with it were inconclusive.

With this package, people have reported orphan socket killed n and WASNotSock(n) errors in the Trumpet TCPMAN log window. The cause, in my opinion is this: The client hard-closes the socket almost immediately after starting the 3-way connection setup process. The server end's TCP has not completed the connect setup, and either the listening socket or the newly cloned "work" socket is killed before the server ever gets to it in the first place. I believe this is what is meant by "orphan" socket.

In the case where the listener socket itself is killed by the incoming RST, it's the end of the line for the server. It never again will receive any incoming connections because the listener socket, the one responsible for servicing incoming connection requests has been killed by the client's RST.

If the client's RST comes in just a bit later, after the TCP has converted the incoming connection to the "work" socket, but before it completes the server's accept() call, it appears that the work socket is "orphan kill"ed but the server gets the FD_CONNECT message, and tries to do an accept(). The accept may succeed, but attempts to do I/O to the socket fail and then the server tries to close it. Since the work socket was bad from the start, the attempt to close it generates the WASNotSock error.

If the RST comes in a bit later, the accept completes, the server gets an FD_CLOSED message and things terminate gracefully.

Again, see the message from Peter Tattam at the end of this paper. He's on to the problems and he's really wants to fix them.

Chameleon 4.02

This package responds to incoming RSTs during the connection process by getting out of whack and leaving the work socket in the SYN_RECEIVED state. It fails to see the RST, and sits there waiting for the ACK to its part of the SYN handshake. There is no way to get rid of the dead socket except to reboot the system, as far as I can tell. After a while, these dead sockets can pile up and fill the socket table. I may have seen one instance where the listener socket gets out of whack also. Believe me, this is a difficult scenario to diagnose for a TCP neophyte like me.

A more insidious error that Chameleon has, however, is its failure to honor the hard close call made by the browser. In this case, the browser forgets about the socket, thinking it successfully hard-closed it. Meanwhile, the server thinks the client issued a soft close and keeps sending to the client. Eventually the low-level buffers at the client end fill (the receive window closes), and the connection becomes stalled. Now the server sits there waiting for the client to start reading again, which it obviously isn't going to do. Eventually, the server's sanity timer goes off, and it closes the socket. However, on at least one TCP package (unnamed until I contact the authors) you cannot close a stalled socket and the server side gets wedged. I this is NOT the fault of the server side socket implementation, because the root of the problem is Chameleon's failure to honor the client's hard close request in the first place.

I have tested Netmanage's "Armadillo" beta, which has now been released as Chameleon 4.5. This package appears to be solid, and somewhat faster than the 4.0x versions.

Conclusions

I had to throttle my impluse to make this message far longer. Those of you who are TCP experts, please pardon my "agricultural" explanations; I really do understand the fine points (now), and I tried to get the gist of this down on paper in my limited time, and I necessarily had to gloss over nuances.

It's a cruel world out there. As most of you on comp.infosystems.www.- providers know, I have spent a great deal of time trying to get the Windows Web server really reliable and capable of high performance. Keep in mind how few messages you have seen regarding the server's capabilities, adherance to HTTP, and features. Most of the traffic relates to problems traceable to the filthy hardware and software environment of the PC and the not-ready-for-prime-time TCP packages out there.

I DO NOT MEAN TO SINGLE OUT CERTAIN PACKAGES. Those were simply the ones I had the time to test with. I am sure there are others out there that have similar kinds of problems; For example, I do know that Novell LAN Workplace doesn't even _support_ the SO_LINGER socket option which is used to select the hard versus soft socket close. I am a whole lot smarter than I was six months ago, but I still feel pretty inexperienced.

REFERENCE: USENET Post from Peter Tattam, Trumpet Software.

-------- BEGIN INCLUDED MESSAGE --------

>Xref: netcom.com alt.winsock:26667
>Path: netcom.com!ix.netcom.com!howland.reston.ans.net!pipex!uunet!munnari.oz.au!newsroom.utas.edu.au!newsroom.trumpet.com.au!jimmy.trumpet.com.au!peter
>Newsgroups: alt.winsock,trumpet.bugs,panix.ppp-slip.general
>From: peter@trumpet.com.au (Peter R. Tattam)
>Subject: Re: Trumpet Winsock 2.0B problem
>Date: Wed, 21 Dec 1994 18:15:11 +1100
>Message-ID: 
>Lines: 80
>References:  <1994Dec16.104031.5526@alder.cc.kcl.ac.uk>
>Organization: Trumpet Software International Pty Ltd.
>X-Newsreader: Trumpet for Windows [Version 1.0 Rev B final beta #4]

[ long explanation of problem deleted (rbd) ]

The problems have been resolved to the best of our knowledge. One
problem was bad packets would somehow get through the IP header checksum
and end up grunging memory later on. The fix has been to religiously
check each packet for length, and headers for length. this is done both
at the SLIP & also at the IP layer. 

Another problem we resolved was if a valid packet got through these
sections and made it to the IP fragment reassembly code with an oversize
fragment (> 1500), memory would again be grunged. Again fixed.

Finally, in the beta program from 1.0A to 2.0, some of the listen/accept
code was remodelled in an attempt to make the winsock pass the WSAT
tests. This has introduced a number of bugs where TCP structures get
lost, and along with them IP buffers...this results in tangled buffer
queues and memory problems.  To tell if you are hitting this bug, the
telltale sign is that the IP buffers get lost, and sometimes listening
sockets just won't work properly. We are confident that we have got to
the root of this problem and are currently acid testing the latest
release. The winsock runs fine for at least a week in an full internet
server environment without any ill effects now. 

[edited]
One program eludes us - httpd. I'm going to throw this program under the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
microscope to see what's happening. The problem only manfests itself
when Netscape kills a connection before it is accepted by httpd I think.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[Right on! I came to the same conclusion last weekend!]
                        
Of course when it comes to other apps tramping all over windows, there's
not a lot we can do. The winsock data structures are required to be in
shared memory, and this carries a degree of risk. If the applications
are well behaved - no problems... but we know how many windows apps
aren't well behaved... :-)

Peter

--
Peter R. Tattam - Managing Director          P.Tattam@trumpet.com.au
Trumpet Software International Pty Ltd.
Phone: 61-02-450220     Fax: 61-02-450210

-- Bob <rdenny@netcom.com>

Windows HTTPD and TCP: A Report Updated 30-Apr-95